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ABSTRACT 

The simplicity of producing and consuming online content 
makes it difficult to estimate how much attention will be de- 
voted from Internet users to any given content. This work 
presents a general overview of temporal patterns in the ac- 
cess to content on a huge collaborative platform. We propose 
a model for predicting the popularity of promoted content, 
inspired by the analysis of the page-view dynamics on Wi- 
kipedia. Compared to previous studies, the observed popu- 
larity patterns are more complex; however, our model uses 
just few parameters to fully describe them. The model is 
validated through empirical measurements. 

Categories and Subject Descriptors 

H.5.3 [Information Interfaces]: Group and Organization 
Interfaces — Computer- supported cooperative work, Web-based 
interaction 

General Terms 

Human Factors, Measurement, Theory 

Keywords 

Wikipedia, promoted content, temporal patterns, popularity 
prediction 



1. INTRODUCTION 

The social media boom gave a birth to a wide range of 
studies about online traces generated by Internet users. One 
of the important research targets addressed by these stud- 
ies is the analysis and prediction of the dynamics of content 
popularity. Historically, most of the studies were focusing 



on the analysis of content generated on blogging [3j[7|, later 
microblogging |9], video-sharing 19 and news-sharing plat- 
forms [5]. However, in many cases the studies reflect only 
the behavior of registered users or focus on a website of in- 
terest only for a specific community, e.g. Slashdot [Bj. Here 
we analyze instead a website of general interest and address 
the problem of understanding online usage and popularity 
patterns through a large-scale analysis of the visitors and 
the users of Wikipedia, the sixth most visited websit^] 

Wikipedia is a free, collaboratively edited and multilingual 
Internet encyclopedia. It has an estimated number of 365 
million monthly readers worldwid^] Although many studies 
looked on editing and commenting activity on Wikipedia [6] 



13 17 22 , there are not many quantitative works focusing 
on the Wikipedia usage by the Internet users. To the best 
of our knowledge, there are just few studies which explore 
Wikipedia views as an information source in order to detect 
and predict events in real world. Osborne et al. [12] used 
a stream of Wikipedia page views to improve the quality of 
discovered events in Twitter, and Mestyan et al. [II] pre- 
dicted the popularity of a movie by measuring the activity 
level of editors and viewers of the corresponding Wikipedia 
authors analyzed how the Wikipe- 
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entry. Finally, in 
dia traffic data is influenced by external and internal events. 

One of the main goals of this work is to examine content 
popularity on Wikipedia. Similar to many online platforms, 
on Wikipedia some of the articles get promoted to the Main 
pag^ In Figure [l] we present an example of the Main page 
of the English Wikipedia. Every Wikipedia user can nom- 
inate any article to the pool of possible future promoted 
articles {featured articles) on a specific pag^] 

We refer to the articleplaced under the headline "From 
today's featured article'rl as promoted, in order to avoid 
confusion with featured articles in general. Every day, at 
OOh(UTC), a new promoted article is placed on the Main 
page together with links to the three articles promoted dur- 
ing the previous three days (see "Recently featured" at the 
bottom left of Figure [T]). The "today's promoted" articles 
are also sent by e-mail to subscribers at 04h(UTC). 
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low but positive correlations when only articles with non- 
zero activities are considered. 

Finally, we identify a specific popularity pattern for Wiki- 
pedia content, in particular, for the number of views of "to- 
day's featured articles". We introduce a model to describe 
this pattern and use this model to predict the popularity of 
promoted Wikipedia content. The model should be applica- 
ble to analyze and compare popularity patterns of promoted 
content on collaborative platforms in general. 
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Figure 1: English Wikipedia Main page on Decem- 
ber, 20, 2012. 



The simplicity of producing and consuming online content 
makes it difficult to predict how much attention will be de- 
voted from Internet users to any promoted item. On one 
hand, a number of studies 



19 20 tries to address this ques- 



tion by analyzing media-sharing platforms such as Digg or 
Youtube. However, such platforms rank and categorize con- 
tent based on previous popularity and user votes, and this 
leads to rich-get-richer bias in the number of views and in 
the duration of the promotion time. On the other hand, the 
concept of content promotion on Wikipedia is distinctively 
different. Promoted articles on Wikipedia are generated and 
managed through online collaboration and shown to the on- 
line audience for a fixed amount of time. This predefined 
exposure duration makes the Wikipedia promotion to have 
in a way more in common with online advertizement than 
with content popularity on other media platforms. 

The rest of the paper is organized as follows. In the next 
section we briefly discuss our main contributions. Then, in 
Section [3] we describe data sets we use in this study. In Sec- 
tion [4] we focus on the general statistics of Wikipedia traffic 
and compare the article- view data with the editing and com- 
menting activities. In the same section we also look on the 
average promoted article popularity. Next, in Section [5] we 
introduce the model to describe the number of views during 
the exposure of a promoted article. In Section[6]we use this 
model to predict the number of views during the exposure 
time of an article. Finally, we discuss the related studies in 
Section [7] and present our conclusions in Section [8] 

2. OUR CONTRIBUTION 

In this work we aim to explore temporal and popularity 
patterns on the English Wikipedia. In particular, our pri- 
mary interest lies in the number of views a page receives per 
hour. We first focus on the overall interest of the Internet 
users on Wikipedia. We analyze total view statistics for the 
English Wikipedia, and argue that these values show some 
tendency to daily and weekly cycles. Second, we compare 
the number of the Wikipedia views with the number of com- 
ments and edits. The temporal characteristics for the latter 
two measures have been analyzed in |6j. We observe that 
the overall dynamics can be described as "There are more 
and more readers of Wikipedia, but they have less and less 
new to add" [I]. 

Moreover, given a predefined set of pages, which we will 
describe in Section [3] we analyze the daily correlations be- 
tween views, comments, edits and distinct editors. We find 



3. DATASET 

We retrieve the page- view values from a database provided 
by Wikimedtc^ In this database we find a file, for every 
hour, which lists the total number of views to a page during 
that hour, provided it received at least one view. We extract 
the page view data between December 9, 2007 at 18h(UTC) 
and March 31, 2010 23h(UTC), for a total of 844 days. Note 
that this database is not entirely complete: for some hours 
there is missing data (see Table [l] for the largest time gaps) 
or there is more than one entry for a single article. For 
the latter we just sum these entries. Finally, we assume 
that views to articles which come through redirects are also 
registered in the data of the target page. 



Table 1: Missing data in page views (UTC). 



Date 


Missing hours 


March 3 - March 4 2008 
August 21 - August 22 2008 
September 21 - October 1 2009 
October 15 - October 16 2009 
January 23 - January 25 2010 


18h - 16h 
12h - 12h 
17h - Oh 
Oh - 2h 
3h - lh 



We also used the Wikipedia dump from March 12, 2010 
(see [8] and [6] for more details) to extract the temporal data 
of edits and comments for the Wikipedia articles. Combin- 
ing these datasets we select 871 395 articles that have been 
commented on at least once in their history. These articles 
accumulated 32- 10 9 views in total which is on average 38- 10 6 
views per day. 

4. PAGE VIEW STATISTICS 

We start with the description of the hourly and daily num- 
ber of views of the English Wikipedia. Next, we compare 
the historical trends in the page views with the trends in 
the number of edits and comments. Finally, we speak about 
the popularity of a promoted article during the promotion 
period. 

4.1 Circadian and weekly patterns 

In Figure [2] we observe the number of visits per hour to 
the English Wikipedia in general and also to its Main page 
averaged by weeks and by days. We observe that the num- 
ber of views of English Wikipedia varies between 1.4 • 10 6 
and 1.8 • 10 6 with an average of 1.6 • 10 6 views per hour. For 
the Main page popularity we find that there are in average 
252 visitors per hour. Interestingly, in Figure 2(a) we ob- 



serve a significant jump in the number of Main page views 
from Olh to 03h(UTC) on Thursdays, that can be explained 
by the presence of some extreme outliers in the Mam page 
views in our dataset. If we remove these outliers (red curve 
in Figure 2(a) l we observe a weekly pattern similar to the 

6 http : //dumps . wikimedia. org/other/pagecounts- raw/ 
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(b) circadian patterns 

Figure 2: Temporal patterns of the number Wikipe- 
dia views per hour in total and for its Main page. 



overall weekly activity which is slightly decreasing during 
weekends. 

Figure |2(b)| depicts the circadian patterns of the Wiki- 
pedia page views. We observe the lowest activity between 
09h and llh(UTC) corresponding to the night hours in the 
US (taking US Central Time (UTC-6)). Similar circadian 
patterns but for editing activity on English Wikipedia were 
observed in [22]. In the following section we discuss the re- 
lations between editing, commenting and viewing activities 
in more detail. 

4.2 Views, edits and comments 

In [fj] Kaltenbrunner and Laniado analyzed changes in the 
global number of edits and comments in the English Wiki- 
pedia. In particular, they confirmed the decreasing trend 
in editing and commenting activities fl7| and obtained the 



ratio of 6 comments per 100 edits. In Figure [3] we use this 
ratio to display the global number of edits and comments 
per day on one scale. In the same figure we also plot the 
corresponding global number of daily views. We note that 
in the latter case the gaps in the curve correspond to the 
data gaps (see Table [TJ. We observe that the global number 
of views of Wikipedia grows over time. Similar results were 
obtained in [TJ, where the authors analyzed the view-trends 



Figure 3: Evolution of the page views, comments 
and edits over time. 



for some selected page-categories. Therefore, the trend of 
using Wikipedia is growing in the same time as the trend of 
making Wikipedia is lessening. 

Our next step is to look on the correlations between the 
global number of views, edits and comments per day. To 
this end we calculate Spearman's rank correlation coeffi- 
cients (see Table [5|. Here we use Spearman's coefficient 
because the values under analysis demonstrate heavy-tail 
behavior and also possess values on different scales. We re- 
fer to [To] for more arguments on why it is essential to use 
rank correlations for heavy-tailed distributed data. In Ta- 
ble [2] we see that the global number of views per day and 
the corresponding number of edits and comments are nega- 
tively correlated. This observation can be explained by the 
differences in trends for these parameters, in particular, the 
global number of article- views is increasing, while other pa- 
rameters are decreasing. If we exclude these trends from the 
data by removing the best linear fit, we find that the num- 
ber of views indeed is correlated with the number of edits 
and comments (values in brackets in Table [2j. Interestingly, 
removing the linear trends decreases the value of correla- 
tion coefficient between the number of edits and comments. 
Though, these characteristics remain very correlated. 



Table 2: Spearman's rank correlation coefficients for 
the global number and the detrended global number 
(in brackets) of views, edits, and comments. 





comments 


# edits 


# views 


-0.24 (0.40) 


-0.22 (0.43) 


# comments 




0.75 (0.53) 



Finally, we look on the relation between the number of 
views and the number of edits and comments per article 
separately. To this end, for each day we calculate the cor- 
responding correlation coefficients for the number of views, 
edits, comments, and distinct editors based on the values 
observed for the pages from a specific article set. In the 
complete set we include all articles that have received at 
least 1 comment in their history. Then, for each day and for 
each pair of characteristics we also construct OR and AND 
sets as follows: we take only pages which have at least one 
or both non-zero values for the selected characteristics for 
the selected day. We find that these daily correlation coeffi- 



Table 3: Spearman's rank correlation coefficients for 
the number of views, edits, and comments an article 
receives per day (complete / OR/ AND sets). 



Cumulative page views for the average article 





# edits 


# editors 


# comments 


set 




0.20 


0.20 


0.04 


complete 


# views 


0.19 


0.19 


0.04 


OR 




0.29 


0.36 


0.16 


AND 






0.99 


0.07 


complete 


# edits 




0.74 


-0.24 


OR 






0.74 


0.23 


AND 








0.07 


complete 


# editors 






-0.28 


OR 








0.21 


AND 



cients are quite stable over time (plots are not shown) with 
the average values reported in Table [3] We observe that 
in spite of the fact that every edit and comment implies a 
view, the number of views per article during a selected day 
is only weakly correlated with the number of edits or editors. 
The correlation between the number of views and comments 
is even lower. However, the correlation is strongest for the 
AND set in all three cases. This indicates a stronger con- 
nection between these quantities for articles with larger edit 
and comment activity. Interestingly, we observe a negative 
correlation between the number of comments and edits per 
day for the OR set. This may indicate that comments and 
edits are made at different times in articles with low activity. 

4.3 Promoted articles 

In the previous sections we have analyzed the temporal 
characteristic of activities on Wikipedia articles in general. 
In the rest of this work we will focus on the page-view data 
for the promoted articles only. Recall that in Wikipedia an 
article gets promoted for a predefined period of 1+3 days, 
which we call the exposure duration in analogy to [20]. We 
restrict our analysis and predictions only to these article 
exposure durations. 

We select all the articles promoted in the time-span from 
January 1st, 2008 through March 31st, 2010, which is a total 
of 822 articles. Among them only 686 have complete page- 
view data, i.e. we know the number of views for every hour 
in their exposure duration. We also omit two more articles 



Page view evolution 
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Figure 4: Progression of the number of views for the 
promoted articles Barack Obama (divided by 2.5 for 
scaling), John McCain, and average article during 
their exposure time. 
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Figure 5: Average normalized popularity of the pro- 
moted articles during the exposure period. 



(Barack Obama and John McCain) which have been both 
promoted on November, 4th, 2008^] These articles, with the 
largest number of views during the second day of exposure, 
show completely different dynamics (see Figure[4| compared 
to the average article (black curve). These pages, therefore, 
would influence some of the results reported below. Thus, 
in this study we use only 684 promoted articles. 

By popularity of a promoted article we mean the number 
of views this article receives during the exposure duration. 
In Figure [5] we show the average normalized popularity of 
a promoted page. Thus, for every promoted article we first 
find the total number of views this article receives at the end 
of the fourth day and then we use this number to obtain time 
series of popularity that monotonically increase from at the 
moment of the article promotion to 1 at the end of the expo- 
sure duration. Using such approach allows us to eliminate 
the differences in specific interestingness among the articles. 
We observe a clear difference in dynamics of popularity dur- 
ing the first and the rest of the exposure days. In both cases 
we see an approximately linear increase of popularity which 
is very different with strictly-concave results for Digg and 
Youtube reported in [19] . 

In Figure [6] we plot the average number of views a pro- 
moted article attracts during the t-ih hour v t , t = 1, . . . , 95. 
We clearly see that the exposure period of a promoted ar- 
ticle in Wikipedia can be divided into four stages. At the 
first stage or at the first hour after a page gets promoted, we 
witness a huge increase in the article's popularity. The value 
«i even is the largest for the average promoted article. The 
second stage contains the remaining hours of the first day of 
the promotion. The third stage is characterized by the neg- 
ative jump occurring after the original article gets replaced 
by the new one. Finally, the last stage contains the view 
dynamics during the 3 days of being promoted in "Recently 
featured". Using this stage-representation, we construct g(t) 
as a piecewise-linear approximation of log(«t) and plot it in 
Figure [6] We refer to Appendix for more details. 

Comparing approximation g(t) and the promoted article 
popularity vt we notice that the main differences are caused 
by the circadian patterns of Wikipedia views. In order to re- 



7 This is the only occasion where 2 articles are promoted at 
once for the reason of the US presidential elections 



Page-view evolution of an average promoted article Example of estimate of page-views (average article) 
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Figure 6: The average number of views of a pro- 
moted article arranged by different time scales. 

move this variation we use a new time scale in which every 
hour is measured in the number of views rather than in 
minutes. This approach was introduced in [19] for Digg sto- 
ries popularity. We modify the original idea by removing a 
constant fraction c of the traffic data to emphasize the cir- 
cadian patterns even more. In the rest of paper we call the 
new time scale as redistributed time scale and set one hour 
in this scale to be equal to 1/24 of the product of (1 — c) 
and the average number of Main Page views per day. We 
again refer to Appendix for more details. In Figure [6] we 
plot the average number of views of the promoted articles 
in the redistributed time scale (c = 0.406) and observe the 
linear decreasing trend of the promoted article popularity 
with time. 

5. MODEL 

Based on the view-behavior of the average over all pro- 
moted articles we propose a model which completely de- 
scribes the traffic dynamics of a selected promoted article 
during its exposure duration. This model is defined by two 
parameters: a constant interest-decay factor for all days of 
the promotion and negative jump of the popularity after the 
first day of the exposure. The number of views the selected 
article receives during the first hour of the promotion is used 
as the only input value of the model. 

5.1 Model definition 

In this section we define the model which explains the dis- 
tribution of the number of views per time unit of a promoted 
article during the exposure duration. We start by describing 
the specific shape of this distribution in redistributed time 
(removing the circadian cycle). To this end we set wi* :— 1 
and define the normalized number of views for the promoted 
article at any redistributed time t* , t* = 2, . . . , 95, as 

Wf = fit" • Wf-l. 

Previously, we have described the four stages of the expo- 
sure life of a promoted article on Wikipedia. Using these 
definitions we set a temporal factor fit* as fit* = 7 for 
t* = 25 and fit* = fi for other t*'s, i.e. for 2 < t* < 24 
and 26 < t* < 95. The constant factor fi models the decay 
of the number of page- views in a typical hour of the exposure 
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Figure 7: Explanation of the promoted article pop- 
ularity model. 

duration, i.e. while the page is promoted on the Main page. 
The factor 7 states for the negative jump in the number of 
views after the promoted article gets moved to "Recently 
featured" position. Thus, we model the shape of the article 
popularity by stage: the first stage of the promoted article 
is characterized by , the second by interest-decay factor 
fi, the third by 7, and the fourth again by the same factor fi. 
To summarize, we model the normalized number of views of 
the promoted article during the t*-th redistributed hour as 
follows: 

. _ f fi'*- 1 , for 2 < t* < 24; 
Wt * ~ \ 7/3** " 2 , for 25 < t* < 95. 

Finally, we use the reverse time-redistribution to find Wt, 
i.e. the corresponding number for wt* but in the original 
time scale. We define the number of views of the promoted 
article during t-th hour as 

v t = Vi* ■ w t , 

where «i and i>i* = v\/w\ are the numbers of views of the 
promoted article after the first hour of exposure in the orig- 
inal and redistributed time scales. In Figure [7] we draw an 
explanation for the model. 

The introduced model uses only the number of views dur- 
ing the first hour of the exposure period v\ as an input 
parameter. In Figure [8] we plot the histogram for Vi's in 
our dataset together with the log-normal fit (fi = 7.63 and 
a = 0.71). In the next section we focus on the estimation of 
parameters fi and 7 which are defining our model. 

5.2 Estimation 

We estimate the model's parameters fi and 7 by using 
page-view data of the promoted articles on the English Wi- 
kipedia. We apply the estimation algorithms for two sets 
of promoted articles. Thus, the first set Si contains all 684 
articles and the second 52 the first 100 promoted articles 
ordered by the date of promotion. We use Si in order to 
describe the general view dynamics for promoted content on 
Wikipedia. We use S2 in Section[6]to predict the popularity 
of the remaining 584 promoted articles given the estimated 
fi and 7 and the corresponding initial values Wi's. 

We denote as vt* (fi, 7) the predicted number of views for 
given values of fi and 7 at redistributed time t* , and as 
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Figure 8: Histogram of occurring values 



for 



Vf the actual number of page views at time t* for some 
promoted article s g S, where S is either Si or 52- Then, 
we calculate parameters j3 and 7 that minimize the error: 



{P,l} = argmin^ 



J2 [log(M/3,7))-log( 



This yields to /3 = 0.9915 and 7 = 0.2805 for Si and to 
P = 0.9960 and 7 = 0.4174 for S 2 - 

So far we have assumed that 7 can be modeled as a con- 
stant factor for all articles. However, since 7 encodes the 
negative jump in the decay of user interest after the first day 
of the exposure, we suggest that it should be correlated with 
the overall popularity of the promoted article. To this end, 
we compared 7 with the total number of views a promoted 
article receives during one day before the promotion date but 
found no correlation between them (data not shown). Then, 
we propose to define 7 as function of the initial popularity 
Vi. We first find that log(«i) and log(7) are negatively cor- 
related (Pearson's correlation coefficient is -0.31 for Si and 
-0.32 for S3), which is also indicated in Figure [9] Then, we 
derive a log-linear function for 7 : 

Cm 

based on the observations {log(wi), log(7)} for the articles 
from set S, where S is again either Si or 52 . We rewrite the 
last equation in the following form: 



log(7) = h(vi) = m ■ log(«i) + log(C). 



(1) 



Using the set Si we obtain m — —0.29 and C = 2.09 
for all articles. We note that for estimation of parameters 
m and C we omit the outlier^] indicated as red squares in 
Figure [9] We also perform the fitting for Q on S2 and 
obtain m — —0.28 and C = 1.90. Note that, although the 
initial estimates for 7 were very different for Si and 5*2, the 
parameters of h(vi) are not. This can be also observed in 
the nearly overlapping linear fits in Figure [9] This may be 
explained by different mean initial views vi's in Si and 5*2. 

Comparing the estimated values for log(7) with \og(h(vi)), 
we find that log(7) ~ Af(h(vi), a 2 ). Therefore, we can derive 

8 These are the articles Borobudur, Princess Beatrice of the 
United Kingdom, Local Government Commission for Eng- 
land (1992), West Indian cricket team in England in 1988 
and Attachment theory. 



Figure 9: Relations between 7 and Vi. The dashed 
red lines indicate the interval [h(t>i) — a,h(vi) + a]. 



an interval in which the decay factor would lie with a given 
probability. We use [h(vi) — a,h(vi) + a] as this interval, 
indicated by the dashed lines in Figure [9] 

Back to the model, we can now calculate u> t * at time 
t* , t* = 2, . . . , 95 as follows: 

= f p e ~\ for2<r <24; 
'* 1 C-vT-P 1 *' 2 , for25<i* <95; {> 

and then use the reverse time-redistribution to find w t . Us- 
ing w t and 



Vi 

Vt — — ■ Wt 
Wi 



(3) 



we can obtain the estimated hourly progression v t of the 
page- views for t = 2, . . . , 95. 

6. POPULARITY PREDICTION 

As we have already discussed before, we use the first 100 
promoted articles (ordered by the date of their exposure) to 
learn parameters P and 7 in order to apply our model for 
prediction of the popularity of the Wikipedia content for the 
remaining 584 promoted articles from our dataset. Thus, 
for each of these articles we take the article's popularity 
after the first hour vi and use Equations Q and |3| for 
parameters p — 0.996 and 7 (m = —0.28 and C = 1.9) from 
the previous section. 

As we will discuss below for most of the promoted articles 
we are able to obtain a good prediction for the page-view 
dynamics during the first day of exposure. However, for the 
remaining days the number of actual page- views Vt does not 
always lie within the predicted interval [h(vi) — a, h(vi) + a] 
for t = 25, . . . , 95, as we see in Figure [TO] Thus, although in 
the 25-th hour we correctly predict the page popularity for 
74% of the articles, in general we observe a decreasing trend 
in the percentage of the correct predictions. This is caused 
by underestimating the decline of interest (or an overestima- 
tion of 7) by our model and can be improved by introducing 
the input parameter V25, i.e. the value of the promoted page 
popularity right after it is moved to the "Recently featured" 
section, into our model. 

Adjusting the prediction during the first hour of the sec- 
ond day of the promotion with V25 leads us to the following 
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Here we again use the reverse time-redistribution to find w t 
to obtain the predicted hourly page- views progression tit, for 
t — 2, . . . , 95, by calculating 
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In Figure [TT] we present two examples for the prediction 
of the popularity for both of the above-defined prediction 
methods. We show the initial prediction in red, the interval 
[h(vi) — a, h(vi)+a] as dark grey area and v t based on vi and 
V25 in blue. While the prediction of the article Styracosaurus 
performs well already using only vi, similar prediction over- 
estimates the views of the article Alice in Chains. 

We analyze the normalized hourly errors ( v * J for all 

articles under study for both prediction methods: errors for 
just Vi are plotted in red, while errors using both v\ and 
«25 in blue. From Figure [l2"| we observe that our prediction 
performs well for the first day of exposure. We recall that for 
this time interval we only use vi for the prediction. For the 
second, the third and the fourth days we observe an increase 
of the spread of hourly errors. However, this increase is much 
smaller for the second prediction technique. 

In Figure |13(a) we present the distribution of the maxi- 
mum normalized hourly error for predictions: both are right- 
skewed. For the method which only uses vi as an input we 
observe more large overestimates. The distribution of the 
minimum (maximum negative) normalized hourly error is 
displayed in Figure 13(b) We observe that it is approx- 



imately normally distributed for both prediction methods, 
but showing larger underestimates for the method using vi 
and i>25- 

Finally, we present the absolute hourly errors (vt — vi) in 
Figure [14] Interestingly, we observe that the absolute error 
towards the end of day 1 is large. This is caused by the 
fact that we model the decay only to be during one specific 
hour whereas for most articles it actually starts a few hours 
before the end of the first day of the exposure duration. 
We also see that the hourly error during the second, the 
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(a) Styracosaurus 



Hourly predicted views for promoted article 
Alice in Chains (August 27, 2009) 
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(b) Alice in Chains 

Figure 11: Examples of the prediction of the page- 
views for a promoted article. 
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Figure 12: Hourly normalized errors. 



third and the fourth days are slightly increasing and follow 
a circadian pattern. This is similar to the observation in 
Figure |12| Again, the prediction method which uses both 
vi and V25 as input outperforms the model that only uses v\. 
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(b) minimum (maximum negative) normalized error 

Figure 13: Distribution of the normalized maximum 
and minimum errors of actual page-views. 
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Figure 14: Hourly errors of actual page-views. 



7. RELATED WORK 

There is a vast literature analyzing different aspects of 
Wikipedia, e.g. see [12^ for an overview. In this section 
we discuss only studies that are closely related to our work. 
Thus, in [16] authors provided a high-level overview of Wiki- 



pedia traffic for 2009 with a particular focus on content-type. 
In [15] Ratkiewicz et al. analyzed the Wikipedia page-view 
data for the thirteen month span in 2008-2009. In par- 
ticular, they reported the heavy-tailed distribution for the 
number of views per page. The authors also argued that 
the top bursty articles in their dataset, where the "bursti- 
ness" was defined as the ratio of the article's traffic of its 
present to previous day, can be divided into two sets. In 
the first set the articles traffic was influenced by the exter- 
nal events and correlated with Google Trends results. In 
the second set the pages accreted their traffic due to inter- 
nal Wikipedia dynamics and, in particular, this traffic was 
correlated with the hits neighbors of these pages received. 
In [II] authors explored relations between the external fac- 
tors and the Wikipedia traffic by using the article-view data 
for the movies to predict the opening box office takings of 
blockbuster movies. 

In [6] Kaltenbrunner and Laniado analyzed the peaks in 
editing and commenting of Barack Obama article on Wi- 
kipedia with the corresponding political and social events. 
They also found that some of the peaks in these activities 
were due to some internal Wikipedia dynamics. The circa- 
dian patterns for editing behavior for the different language 
Wikipedias were recently investigated in 
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A number of studies focused on promoted content and 
news popularity. In !2j authors reported power laws in the 
distribution of views of the short-lived events, such as news 
on a large Hungarian news portal. They also found that 
for most of the news items, the number of views decays 
significantly 36 hours after posting. 

The log-normal distribution was found to describe the 
number of comments per minute a news story on Slashdot 
receives [S]. The authors combined this fact with the circa- 
dian activity cycle of the website to be able to predict the 
number of comments a news post on Slashdot receives per 
time unit. A similar idea based on rescaling average access 
patterns was also used in [18] to predict the long time pop- 
ularity of content on Youtube and Digg. The later study 
introduced the notion of Digg-time as we use in here in this 
study. 

In [2l] Wu and Huberman confirmed the log-normal na- 
ture of the news story popularity on the social news por- 
tal Digg. They also proposed a stochastic dynamic model 
which defines the page-view increase as a random variable 
multiplied by some converging to zero factor and the pre- 
vious popularity. The expected value of the random vari- 
ables modeled the fraction of people which would spread 
the news story to their social neighbors. The stochastic 
model was later expanded for Reddit and Epinions [20] and 
Youtube [l9], extending their work of [18]. Similar to our 
model, they used a single value as the baseline for their pre- 
diction and they used a parameter to model the decay of 
interest. However, in contrast to our model, they added 
an extra random variable, to account for randomness in the 
data they extracted. 

8. CONCLUSIONS 

We have presented a simple yet powerful model for the 
view dynamics of promoted content on Wikipedia. The 
model shows that the number of views an article receives 
decays exponentially in time with a constant decay rate, if 
the dependency of the data on Wikipedia's circadian activity 
cycle is removed. The only exception from this decay rule 



is the presence of a larger decay when an article is moved 
from the "today's featured" to the list of "Recently featured" 
after 24h of being promoted. The model allows to predict 
the popularity of an article using only the number of views 
it received during the first hour of exposure. The quality of 
the prediction can be improved if the model is updated right 
after an article is moved to "Recently featured" section. 

Our model should allow to describe and compare view 
dynamics on other websites or parts of websites with similar 
update strategies, e.g. online newspapers which are updated 
on a daily basis, or a list of today's recommended items 
(mobile apps, products, etc). The decay factor might be a 
useful parameter to account for the half-live of a piece of 
content on a given site. The findings might also be useful 
to predict the success rate of new online advertisements or 
sponsored content in general. 

In this study we focused only on promoted articles, which 
provide a nearly ideal experimental setup to study content 
popularity. There is no competition between articles for 
the users' attention as there is only one promoted article 
per day. Also the time of promotion is periodical and fixed 
which gives all articles the exact same attention time spans. 
However, this is not the only section on the Main page that 
can be investigated. Future work should extend the model 
to other sections with more complicated settings like In the 
news, On this day and Did you know. 

Furthermore, the occurrence of page- view peaks should be 
investigated. These findings could be compared to the re- 
sults of peaks in editing and commenting behavior [6]. This 
would give more insight into patterns of sudden attention 
spikes of users in Wikipedia. 

Finally, we have observed a growing trend in page views 
on the English Wikipedia, whereas the amount of edits and 
comments is decreasing. Thus it seems that, indeed, there 
are more and more readers of Wikipedia, but it becomes 
increasingly more difficult to add something new. 
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APPENDIX 

Piecewise-linear approximation 

Using the definition of the stages in the page view dynamics 
of an average promoted article in English Wikipedia we de- 
fine a piecewise-linear approximation of the logarithm of the 
average number of views a promoted article attracts during 
the i-th hour as g(t) : 

g(t) = a t + b t t, 

where a t = 0, b t = 7.6749 for t = 1; at = 7.6851, b t = 
-0.0102 for t = 2,..., 24; a t = 45.1967, b t = -1.5103 for 
t = 25; and at = 6.2234, b t = -0.0113 for t = 26, . . . , 95. 

Circadian patterns correction 

Since the total page- view activity in Wikipedia varies during 
the day, these circadian cycles influence the promoted page 
popularity and caused the differences between the observed 
values and the approximation exp(g(t)). Following [l9] we 
first use a redistributed time scale where one hour equals 
1/24 of the daily traffic of the Main Page T. We note 
that this value could be a constant {T ma in = 6 050 for 
our dataset) or change every day, month, or any time pe- 
riod. A new hour t' is therefore is the time interval which 
takes the Main page to accumulate from to J^T views. 

This "decycling" allows us to ignore the dependency of the 
promoted article's popularity on the total website traffic. 
In Figure [15] we plot the average number of views a pro- 
moted article attracts measured in the new time scale (Re- 
distributed, c=0). 

After applying the new time scale we mitigate the depen- 
dence of the promoted page views on the time of the day, 
however we did not remove it completely. A possible expla- 
nation would be that the normalization by T is not sensitive 
enough to the hourly changes in the traffic of the Main page. 
We propose to remove some constant fraction of the Main 
page view data before performing the redistribution in order 
to give more value to these traffic changes. Thus, we denote 
as m(t) the average number of Main page views for given 
hour t — 1, . . . , 24. Recall that T = X/t=i m (t) an d define 
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new redistribution parameter T* as follows: 

24 24 

T* = y^m*{t) = VVm(t) - cminm(t)], 
t=i t=i 

where c = argmin [X)t*=o ( lo S ( w **(o)) ~ S^*)) 2 ] and v t*( c ) 
is the number of views of an average promoted page at time 
t* in new time scale defined by T* . In Figure ^] we plot 
the examples of the redistributions for different values of 
c. We find that c = 0.406 is the optimal value for "decy- 
cling" based on the Main page views. In other words one 
needs to remove 40% of minimum of the hourly traffic of the 
Wikipedia Main page to make an optimal correction of the 
circadian patterns. 



Figure 15: Comparing different parameter values for 
improving the estimate. (And the function g(t) in 
the subplot) 



